This page documents the data cleaning and preparation procedure, variable selection, statistical modeling, and survival probability calibration procedures used in these research articles: 1) A Design of Experiement Approach for Improving Performance of Machine Learning Algorithms in predicting Survival of Transplant Surgeries. & 2) A Two-Stage Machine Learning Approach to Predict Heart Transplantation Survival Probabilities over Time with a Monotonic Probability Constraint.
Details of the codes and the functions are available over Github of the study: https://github.com/transplantation/unos-ht
Data was provided from the UNOS registry by staff at the US, United Network for Organ Sharing.
The reader can show any code chunk by clicking on the code button. We chose to make the default for the code hidden since we: (a) wanted to improve the readability of this document; and (b) assumed that the readers will not be interested in reading every code chunk.
Figure 1 provides an overview of our two-stage framework for obtaining monotonic survival probabilities over time.
Figure 1: The proposed two-stage framework for obtaining monotonically decreasing heart transplantation survival probabilities.
The snippet below documents the list of R packages and functions that were used in this research. For convenience, we used the pacman package since it allows for installing/loading the needed packages in one step.
rm(list = ls()) # clear global environment
graphics.off() # close all graphics
library(pacman) # needs to be installed first
# p_load is equivalent to combining both install.packages() and library()
p_load(haven,dplyr,caret,foreign,glmnet,
lubridate,dataPreparation,httr, DT, stringr, AUC, snow, testit,caretEnsemble,
C50, Biocomb, varImp, party,
randomForest,
kernlab,gower,
e1071,Boruta,ROSE,DMwR,
unmarked) # unmarked is for downloading and reading an .rds file.
# UNOS data are loaded from this local drive, data could be obtained from www.unos.org
data.path <- "C:\\Users\\hahadydolatsara\\OneDrive - Auburn University\\Transplant\\BUAL-LAB\\DoE_10years\\"
source("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/functions.R")
In this snippet below, we load the data file and the form that records the information regarding variables in the data. Both files are provided by UNOS but we added one column: INTERPRETATION_TYPE to the information form. It records the variable type for each variable in the data. This is a varaible we created based on the code book provided by UNOS. The information can be found here. The data dictionary can be found here.
# load data set
heart.df <- read_sas(paste0(data.path,"thoracic_data.sas7bdat"))
# add ids for each row
heart.df$ID <- row.names(heart.df)
# heart.form contains the variable definitions of each variable in data
heart.form <- read.csv("C:/Users/hahadydolatsara/OneDrive - Auburn University/Transplant/BUAL-LAB/DoE_10years/Interpretation_type.csv")
Here is the information included in the form.
Since our focus is to study the long term survival perdiction for an adult patient after a heart transplant is performed, we drop some variables and observations, based on the following criteria:
## In the following lines, we want to find variables that are not used after 2000 or they were added after 2000.
# Frist, we removes whitespace from start and end of each element in the variable VAR.END.DATE
var.end.dates <- trimws(heart.form$VAR.END.DATE) %>% str_trim()
# Second, identify the varaibles that are still used.
names.of.var.did.not.end <- heart.form[which(var.end.dates==""), 1]
# Make sure ID is included.
names.of.var.did.not.end <- c(as.character(names.of.var.did.not.end), "ID")
# Third, identify the variables that are not used now.
vars_ended <- heart.form[which(heart.form$VARIABLE.NAME %in% names(heart.df)[!(names(heart.df) %in% names.of.var.did.not.end)]),(1:2)]
# Figure out which date each variable was added
vars.added.dates <- heart.form$VAR.START.DATE %>% as.character.Date()
# Fix an error from the information, there were two dates corresponding to one variable. We chose the later date.
vars.added.dates[which(vars.added.dates=="01-Oct-87, 01-Oct-90")] <- "01-Oct-90"
# Figure out which year each variable was added and add this information to the heart form
heart.form$YR_ADDED <- sapply(vars.added.dates, function(x) str_extract_all(x,"[0-9]{1,2}")[[1]][2]) %>% as.integer()
# available variables added before 2000 and they are still used now
vars.added.before.2000 <- subset(heart.form, YR_ADDED>=87, select = c(1))
vars.added.NA <- subset(heart.form,is.na(YR_ADDED),select = c(1))
vars.added.all <- rbind(vars.added.before.2000,vars.added.NA)
vars.added.all <- vars.added.all[["VARIABLE.NAME"]] %>%
as.character()
vars.added.all <- c(as.character(vars.added.all), "ID")
# Based on the criteria we have to find a subset of data: getting rid of variables that are ended and patients who were under 18 or too light or too short or didn't have a heart transplant.
heart.df.cleaned <- subset(heart.df, WL_ORG=="HR") %>% # Heart
subset(AGE>=18) %>% # Adults only
# we excluded too light or too short people
subset(WGT_KG_DON_CALC >= quantile(WGT_KG_DON_CALC, 0.0001, na.rm = TRUE)) %>%
subset(WGT_KG_TCR >= quantile(WGT_KG_TCR, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_DON_CALC >= quantile(HGT_CM_DON_CALC, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_TCR >= quantile(HGT_CM_TCR, 0.0001, na.rm = TRUE)) %>%
subset(select=intersect(names.of.var.did.not.end,vars.added.all))
# Find variables that are related to dates
vars_discarded <- heart.form %>%
subset(INTERPRETATION_TYPE=="D", select=c(1,2))
heart.discard <- vars_discarded$VARIABLE.NAME %>% as.character()
# Identify the variables that are related to dates in the data and remove them
heart.discard <- intersect(heart.discard, colnames(heart.df.cleaned))
heart.df.cleaned <- select(heart.df.cleaned, -heart.discard)
In this section, we create several variables based on several references. 1. Medved, Dennis, et al. “Improving prediction of heart transplantation outcome using deep learning techniques.” Scientific reports 8.1 (2018): 3613. 2. Dag, Ali, et al. “Predicting heart transplantation outcomes through data analytics.” Decision Support Systems 94 (2017): 42-52.
# ref1: Medved, Dennis, et al. "Improving prediction of heart transplantation outcome using deep learning techniques." Scientific reports 8.1 (2018): 3613.
# refer to a tool provided in the: http://ihtsa.cs.lth.se/ , which is product of this paper:
# https://www.nature.com/articles/s41598-018-21417-7.pdf
# ref2: Dag, Ali, et al. "Predicting heart transplantation outcomes through data analytics." Decision Support Systems 94 (2017): 42-52.
# create several variables based on references and obtain the subset of the data based on the criteria
heart.df.cleaned <- subset(heart.df.cleaned) %>%
#ref1
mutate(PVR = (HEMO_PA_MN_TRR- HEMO_PCW_TRR)*79.72/HEMO_CO_TRR) %>%
#ref1
mutate(ISCHTIME = ISCHTIME*60) %>%
#ref1
mutate(ECMO = ifelse(ECMO_TCR + ECMO_TRR == 0, 0, 1)) %>%
# PVR, pulmonary vascular resistance / its calculation is based on the below mentioned links:
# https://en.wikipedia.org/wiki/Vascular_resistance
# http://www.scymed.com/en/smnxph/phkhr013.htm
# https://radiopaedia.org/articles/mean-pulmonary-arterial-pressure (calculation of Mean Pulmonary Arterial Pressure)
# PVR= (Mean Pulmonary Arterial Pressure (mmHg) - Pulmonary Capillary Wedge Pressure (mmHg)) * 79.72 / Cardiac Output (L/min)
# PVR = (HEMO_PA_MN_TRR - HEMO_PCW_TRR)* 79.72 / HEMO_CO_TRR
# ECMO / merge of (ECMO_TCR, ECMO_TRR)
# The following variables are mutated by the authors
mutate(BMI_CHNG = 100*(BMI_CALC- INIT_BMI_CALC)/INIT_BMI_CALC) %>%
# mutate(WAITING_TIME = TX_DATE - INIT_DATE) %>% #no need, because it is already in there as "DAYSWAIT_CHRON"
mutate(WGT_CHNG = 100*(WGT_KG_CALC - INIT_WGT_KG_CALC)/INIT_WGT_KG_CALC) %>%
mutate(HGT_CHNG = 100*(HGT_CM_CALC - INIT_HGT_CM_CALC)/INIT_HGT_CM_CALC) %>%
mutate(AGE_MAT = abs(AGE - AGE_DON)) %>%
mutate(BMI_MAT = abs(BMI_CALC - BMI_DON_CALC))
In this section, we re-group levels in some categroical variables and develop some categorical variables based on literature considering the pool of patients.
The function we use to re-group levels in a categorical variable is cat_changer() and it has four input values. The usage of this function can be found in functions.R over the Github of the study. In the end of this snippet, we remove two variables (DEATH_CIRCUM_DON" and DEATH_MECH_DON) due to discrepency.
# We regroup Diagnosis variables (three variables: DIAG, TCR_DGN, THORACIC_DGN) as follows
val_old <- c(1000,1001,1002,1003,1004,1005,1006,1049,1007,1200,999,1999)
val_new <- c("DILATED_MYOPATHY_IDI","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_ISC","CORONARY","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="DIAG",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="TCR_DGN",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="THORACIC_DGN",val_old,val_new)
# The variable: COD_CAD_DON is re-grouped
val_old <- c(1,2,3,4,999,"Unknown")
val_new <- c("ANOXIA","CEREBROVASCULAR_STROKE","HEAD_TRAUMA","OTHER","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="COD_CAD_DON",val_old,val_new)
# The variables regarding blood types are re-grouped.
val_old <- c("A","A1","A2","B","O","AB","A1B","A2B")
val_new <- c("A","A","A","B","O","AB","AB","AB")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ABO",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ABO_DON",val_old,val_new)
# The variable: DIAB is re-grouped.
val_old <- c(1,2,3,4,5,998)
val_new<-c("N","ONE","TWO","OTHER","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="DIAB",val_old,val_new)
# previous cardiac surgery /merge of (PRIOR_CARD_SURG_TCR, PRIOR_CARD_SURG_TRR)
heart.df.cleaned$CARD_SURG <- NA
for(i in 1:nrow(heart.df.cleaned)){
if(!is.na(heart.df.cleaned$PRIOR_CARD_SURG_TCR[i])){
if(heart.df.cleaned$PRIOR_CARD_SURG_TCR[i]=="Y"){
heart.df.cleaned$CARD_SURG[i] <- "Y"
}
if(heart.df.cleaned$PRIOR_CARD_SURG_TCR[i]=="N"){
if(!is.na(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i])){
if(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i]=="N"){
heart.df.cleaned$CARD_SURG[i] <- "N"
}
}
}
}
if(!is.na(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i])){
if(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i]=="Y"){heart.df.cleaned$CARD_SURG[i] <- "Y"
}
}
}
# In the data set, several variables are related to different types of antigen alleles. Each antigen has two allele types, so these variables recoreded: 0: no mismatch, 1: one matched, 2: both mismatched. Instead of using these variables, we use the variables that recored summaries of matches in these antigen alleles: HLAMIS, AMIS, BMIS , DRMIS and remove the following variables:
heart.df.cleaned[c("DA1","DA2","RA1","RA2","DB1","DB2","RB1","RB2","RDR1","RDR2","DDR1","DDR2")] <- NULL
# refrence for HLAMIS
# Weisdorf, Daniel, et al. "Classification of HLA-matching for retrospective analysis of unrelated donor transplantation: revised definitions to predict survival." Biology of Blood and Marrow Transplantation 14.7 (2008): 748-758.
# refrences for HLAMIS, AMIS, BMIS, DRMIS:
# Parham, Peter. The immune system. Garland Science, 2014 (PAGE 446).
val_old <- 0:6
val_new <- c("LOWEST","LOWEST","LOWEST","LOW","MEDIUM","HIGH","HIGHEST")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HLAMIS",val_old,val_new)
# the following block is our variable manipulation based on the other literature as specified in Ali Dag's paper.
heart.df.cleaned$ETH_MAT <- NA
for(i in 1:nrow(heart.df.cleaned)){
if(!is.na(heart.df.cleaned$ETHCAT[i])){
if(!is.na(heart.df.cleaned$ETHCAT_DON[i])){
if(heart.df.cleaned$ETHCAT_DON[i]==heart.df.cleaned$ETHCAT[i]){
heart.df.cleaned$ETH_MAT[i] <- "Y"
}else{
heart.df.cleaned$ETH_MAT[i]<-"N"
}
}
}
}
heart.df.cleaned$GENDER_MAT <- NA
for(i in 1:nrow(heart.df.cleaned)){
if(!is.na(heart.df.cleaned$GENDER[i])){
if(!is.na(heart.df.cleaned$GENDER_DON[i])){
if(heart.df.cleaned$GENDER[i]==heart.df.cleaned$GENDER_DON[i]){
heart.df.cleaned$GENDER_MAT[i]<-"Y"
}else{
heart.df.cleaned$GENDER_MAT[i]<-"N"
}
}
}
}
# PROC_TY_HR, from literature
val_old <- c(1,2)
val_new <- c("BICAVAL","TRADITIONAL")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PROC_TY_HR",val_old,val_new)
# The variable: SHARE_TY is regrouped.
# ALLOCATION TYPE-LOCAL/REGIONAL/NATIONAL - 3=LOCAL/4=REGIONAL/5=NATIONAL/6=FOREIGN
val_old <- c(3,4,5)
val_new <- c("LOCAL","REGIONAL","NATIONAL")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="SHARE_TY",val_old,val_new)
# The variable: EDUCATION is regrouped.
val_old <- c(1,2,3,4,5,6,996,998)
val_new <- c("OTHER","GRADE","HIGH","COLLEGE","UNIVERSITY","UNIVERSITY","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="EDUCATION",val_old,val_new)
# The variables: ETHCAT, ethnicity of recepients are re-grouped
val_old <- c(1,2,4,5,6,7,9,998)
val_new <- c("WHITE","BLACK","HISPANIC","OTHER","OTHER","OTHER","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ETHCAT",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ETHCAT_DON",val_old,val_new)
# We dropped PRI_PAYMENT_CTRY_TRR and PRI_PAYMENT_CTRY_TRR because of too many NAs
heart.df.cleaned[c("PRI_PAYMENT_CTRY_TCR","PRI_PAYMENT_CTRY_TRR")] <- NULL
# Two variables: PRI_PAYMENT_TCR and PRI_PAYMENT_TRR are re-grouped.
val_old <- seq(1,15)
val_new<-c("PRIVATE","MEDICAID","MEDICARE_FREE","PUBLIC_OTHER","PUBLIC_OTHER","PUBLIC_OTHER","PUBLIC_OTHER","OTHER","OTHER","OTHER","OTHER","OTHER","PUBLIC_OTHER","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRI_PAYMENT_TCR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRI_PAYMENT_TRR",val_old,val_new)
# The variable: REGION is re-grouped.
val_old <- seq(1,11)
val_new <- c("NOTH_EAST","NOTH_EAST","SOUTH_EAST","SOUTH_EAST","WEST","WEST","MIDWEST","MIDWEST","NOTH_EAST","MIDWEST","SOUTH_EAST")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="REGION",val_old,val_new)
# here we re-group two variables: FUNC_STAT_TRR and FUNC_STAT_TCR, based on their activity level and hospitalization status
# https://www.communitycarenc.org/media/tool-resource-files/what-does-it-take-qualify-personal-care-services-d.pdf
# http://www.npcrc.org/files/news/karnofsky_performance_scale.pdf
val_old <- c(1,2,3,996,998,2010,2020,2030,2040,2050,2060,2070,2080,2090,2100)
val_new <- c("ABLE","ASSISTED","DISABLE","OTHER","UNKNOWN","DISABLE","DISABLE","DISABLE","DISABLE","ASSISTED","ASSISTED","ASSISTED","ABLE","ABLE","ABLE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="FUNC_STAT_TRR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="FUNC_STAT_TCR",val_old,val_new)
#Due to discrepency, we drop these 2 variables especially the frequencies of natural cause for death are not consistent
heart.df.cleaned[c("DEATH_CIRCUM_DON","DEATH_MECH_DON")] <- NULL
In this section, we re-group levels in some categroical variables and develop some categorical variables based on their definitions from the code book provided by UNOS and their corresponding distributions. In addition, we drop categorical variables that have more than 90% and Numerical variables that 30% of their observations are NAs, variables that have more than 90% of observations are in the same level (category).
# We drop variables that are related to identifiers or dates or not related to the goal of the study.
# We drop maligancy (MALIG_TY,MALIG_TY_TCR) variables because the levels of categories are not distinguishable well and there are too many NAs.
heart.df.cleaned[ c("WL_ID_CODE", "WL_ORG","INIT_DATE","TX_DATE","CTR_CODE","DATA_TRANSPLANT",
"DATA_WAITLIST","DISTANCE", "DON_RETYP","ECD_DONOR","END_OPO_CTR_CODE","HOME_STATE_DON",
"INIT_OPO_CTR_CODE", "LISTING_CTR_CODE","LOS",
"MALIG_TY","MALIG_TY_TCR","OPO_CTR_CODE","ORGAN","OTH_LIFE_SUP_OSTXT_TCR","OTH_LIFE_SUP_OSTXT_TRR",
"PERM_STATE","PRIOR_CARD_SURG_TYPE_OSTXT_TCR","PRIOR_CARD_SURG_TYPE_OSTXT_TRR","PT_CODE",
"TRR_ID_CODE")] <- NULL
# Nine variables: CMV_DON, EBV_SEROSTATUS, HBV_CORE, HBV_CORE_DON, HBV_SUR_ANTIGEN, HCV_SEROSTATUS, HEP_C_ANTI_DON, HTLV2_OLD_DON, HIV_SEROSTATUS are re-grouped.
val_old <- c("C","I","N","ND","P","PD","U")
val_new <- c("UNKNOWN","UNKNOWN","NEGATIVE","UNKNOWN","POSITIVE","UNKNOWN","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="CMV_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="EBV_SEROSTATUS",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HBV_CORE",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HBV_CORE_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HBV_SUR_ANTIGEN",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HCV_SEROSTATUS",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HEP_C_ANTI_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HTLV2_OLD_DON",val_old,val_new)
# In the code book, the variable HIV_SEROSTATUS didn't specify the SAS ANALYSIS FORMAT, however, after checking it's values we believe it also uses SERSTAT.
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HIV_SEROSTATUS",val_old,val_new)
# Here the NA equivalent characters are changed to NA
{
NA_cells <- c(""," ","U")
for(i in 1:length(NA_cells)){
heart.df.cleaned[heart.df.cleaned == NA_cells[i]] <- NA
gc()}
}
# Two variables: BRONCHO_LT_DON & BRONCHO_RT_DON are re-grouped.
val_old <- c(1,2,3,4,5,6,7,998)
val_new <- c("N","NORMAL","ABNORMAL","ABNORMAL","ABNORMAL","ABNORMAL","OTHER","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="BRONCHO_LT_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="BRONCHO_RT_DON",val_old,val_new)
# The variable: CHEST_XRAY_DON is re-grouped.
val_old <- c(0,1,2,3,4,5,998,999)
val_new <- c("UNKNOWN","N","NORMAL","ABNORMAL_SINGLE","ABNORMAL_SINGLE","ABNORMAL_BOTH","UNKNOWN","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="CHEST_XRAY_DON",val_old,val_new)
# The variable: CORONARY_ANGIO is re-grouped.
val_old <- c(1,2,3)
val_new <- c("N", "Y_NORMAL","Y_ABNORMAL")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="CORONARY_ANGIO",val_old,val_new)
# Two variable: HIST_DIABETES_DON and HYPERTENS_DUR_DON are re-grouped.
val_old <- c(1,2,3,4,5,998)
val_new <- c("N","Y","Y","Y","Y","UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HIST_DIABETES_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HYPERTENS_DUR_DON",val_old,val_new)
# Two variables: END_STAT, INIT_STAT are re-grouped.
# although 2099 has low frequeency we do not merge based on the literature: Huang, Edmund, et al. "Incidence of conversion to active waitlist status among temporarily inactive obese renal transplant candidates, Transplantation 98.2 (2014): 177-186."
val_old <- c(1010, 1020, 1030, 1090, 1999,2010,2020,2030,2090,2999,7010,7999)
val_new <- c("ONE","ONE","TWO","ONE","TEMPORARILY_INACTIVE","ONE","ONE","TWO","ONE","TEMPORARILY_INACTIVE","ACTIVE","TEMPORARILY_INACTIVE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="END_STAT",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="INIT_STAT",val_old,val_new)
# The variable: LAST_INACT_REASON definition is re-grouped
# We got its definition from: https://www.srtr.org/requesting-srtr-data/saf-data-dictionary/
# 1: Candidate cannot be contacted
# 2: Candidate choice
# 3: Candidate work-up incomplete
# 4: Insurance issues
# 5: Medical non-compliance
# 6: Inappropriate substance use
# 7: Temporarily too sick
# 8: Temporarily too well
# 9: Weight currently inappropriate for transplant
# 10: TX'ed - removal pending UNET data correction
# 11: Inactivation due to VAD implantation and/or VAD complication
# 12: TX Pending
# 13: Physician/Surgeon unavailable
# 14: Candidate for living donor transplant only
# We regroup this variable based on if it's inactive due to health issues
val_old <- seq(1,14)
val_new <- c("OTHER", "OTHER","OTHER","OTHER","OTHER","OTHER","HEALTH","OTHER","HEALTH","OTHER","HEALTH","OTHER","OTHER","OTHER")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="LAST_INACT_REASON",val_old,val_new)
# The variable: NUM_PREV_TX is re-grouped.
# later we regroup this variable.
val_old <- seq(0,10)
val_new <- c("ZERO", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="NUM_PREV_TX",val_old,val_new)
# Two variables: PRIOR_CARD_SURG_TYPE_TCR and PRIOR_CARD_SURG_TYPE_TRR are re-grouped.
#search for procurement in this form: All_Forms_eg_RH.pdf
# CABG: Coronary artery bypass graft
val_old <- seq(1,31)
val_new <- c("CABG_ONLY","VALVE_ONLY", "MULTIPLE","OTHER", "MULTIPLE","MULTIPLE", "MULTIPLE","OTHER", "MULTIPLE","MULTIPLE",
"MULTIPLE","MULTIPLE", "MULTIPLE","MULTIPLE", "MULTIPLE","OTHER", "MULTIPLE","MULTIPLE", "MULTIPLE","MULTIPLE",
"MULTIPLE","MULTIPLE", "MULTIPLE","MULTIPLE", "MULTIPLE","MULTIPLE", "MULTIPLE","MULTIPLE", "MULTIPLE","MULTIPLE", "MULTIPLE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRIOR_CARD_SURG_TYPE_TCR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRIOR_CARD_SURG_TYPE_TRR",val_old,val_new)
# We changed the following medicines to "HEPARIN", "ANCEF", "DOPAMINE", "ZOSYN" since these are the most repeted medicines.
temp <- heart.df.cleaned[c("PT_OTH1_OSTXT_DON", "PT_OTH2_OSTXT_DON","PT_OTH3_OSTXT_DON", "PT_OTH4_OSTXT_DON")]
new_vars <- data.frame(matrix(NA, ncol=4, nrow=nrow(heart.df.cleaned)))
colnames(new_vars) <- c("HEPARIN","ANCEF","DOPAMINE","ZOSYN")
temp <- as.data.frame(apply(temp, 2, function(x) gsub("^$| ^", NA, x)), stringsAsFactors=FALSE)
for (i in 1:ncol(new_vars)){
new_vars[,i] <- apply(temp, 1, function(x) detect_terms(x, colnames(new_vars)[i]))
}
# we remove PT_OTH1_OSTXT_DON,PT_OTH2_OSTXT_DON,PT_OTH3_OSTXT_DON,PT_OTH4_OSTXT_DON and use the four variables: HEPARIN, ANCEF, DOPAMINE, ZOSYN we just create.
heart.df.cleaned[c("PT_OTH1_OSTXT_DON","PT_OTH2_OSTXT_DON","PT_OTH3_OSTXT_DON","PT_OTH4_OSTXT_DON")]<-NULL
heart.df.cleaned<-cbind(heart.df.cleaned,new_vars)
rm(list=c("new_vars")) #remove the unnecessary object
# The variable: STERNOTOMY_TRR is re-grouped.The definition is based on pediatric TRR
val_old <- c(1,2,3,998)
val_new <- c("ONE", "MORE","MORE", "UNKNOWN")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="STERNOTOMY_TRR",val_old,val_new)
# We drop the variable: TX_YEAR from dataset since we want to have a more versatile model
heart.df.cleaned$TX_YEAR<-NULL
# The variable: ABO_MAT is re-grouped.
val_old <- c(1,2,3)
val_new <- c("IDENTICAL", "NOT_IDENTICAL", "NOT_IDENTICAL")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ABO_MAT",val_old,val_new)
# The variable: AMIS, BMIS, AND DRMIS are re-grouped.
val_old <- c(0,1,2)
val_new <- c("NO_MISMATCH", "ONE_MISMATCHED", "TWO_MISMATCHED")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="AMIS",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="BMIS",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="DRMIS",val_old,val_new)
# The variable: INOTROPES_TCR, INOTROPES_TRR, OTHER_INF_DON, AND PULM_INF_DON are re-grouped.
val_old <- c(0,1)
val_new <- c("N", "Y")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="INOTROPES_TCR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="INOTROPES_TRR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="OTHER_INF_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PULM_INF_DON",val_old,val_new)
# The variable: MED_COND_TRR are re-grouped.
val_old <- c(1,2,3)
val_new <- c("ICU_HOSPITALIZED", "HOSPITALIZED","NOT_HOSPITALIZED")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="MED_COND_TRR",val_old,val_new)
#===================================================================
# Here we drop columns that 90% of their data is NA.
NA_Col_Rate <- col_missing_function(heart.df.cleaned)
NA_Col_Rate$varname <- rownames(NA_Col_Rate)
NA_Col_Rate <- NA_Col_Rate[which(NA_Col_Rate$na_count_col>0.9),]
NA_Col_Rate <- NA_Col_Rate$varname
heart.df.cleaned[NA_Col_Rate] <- NULL
#===================================================================
# Here we drop variables that have more than 90% of the observations in one level / category.
cat_dis <- vector(mode="numeric", length=ncol(heart.df.cleaned))
for(i in 1:ncol(heart.df.cleaned)){
struct <- as.data.frame(table(heart.df.cleaned[i]))
if(nrow(struct)>0){
max_cat <- max(struct$Freq)
all_freq <- sum(struct$Freq)
if((max_cat/all_freq)>0.90) cat_dis[i] <- 1}
}
heart.df.cleaned[names(heart.df.cleaned[(cat_dis==1)])]<-NULL
#===================================================================
# Here we define categorical and numerical variables
initial_num <- heart.form$VARIABLE.NAME[which(heart.form$INTERPRETATION_TYPE=="NUM")]
mutated_num <- c("BMI_CHNG" ,"WGT_CHNG" ,"HGT_CHNG" ,"AGE_MAT","BMI_MAT","PVR","ECMO")
pool_num <- c(as.character(initial_num),mutated_num)
# the numerical variables in the cleaned dataset are saved in pool_num_clean
pool_num_clean <- pool_num[which(pool_num %in% names(heart.df.cleaned))]
initial_char <- heart.form$VARIABLE.NAME[which(heart.form$INTERPRETATION_TYPE=="CHAR")]
mutated_char <- c("GENDER_MAT","ETH_MAT" ,"CARD_SURG" ,"HEPARIN" ,"ANCEF" ,"DOPAMINE","ZOSYN")
pool_char <- c(as.character(initial_char),mutated_char)
# the categorical variables in the cleaned dataset are saved in pool_char_clean
pool_char_clean <- pool_char[which(pool_char %in% names(heart.df.cleaned))]
#===================================================================
# Here we drop the numerical variables that more than 30% of the observations are NA.
# we decided not to impute numerical values, so a more conservative approach is adopted.
NA_Col_Rate <- col_missing_function(heart.df.cleaned[pool_num_clean])
NA_Col_Rate$varname <- rownames(NA_Col_Rate)
NA_Col_Rate <- NA_Col_Rate[which(NA_Col_Rate$na_count_col>0.3),]
NA_Col_Rate <- NA_Col_Rate$varname
heart.df.cleaned[NA_Col_Rate] <- NULL
# changing category of "NUM_PREV_TX" from numerical to categorical
pool_num<-pool_num[!pool_num %in% "NUM_PREV_TX"]
pool_char<-c(pool_char,"NUM_PREV_TX")
#####################
# We updated numerical and categorical variables used here:
pool_num_clean <- pool_num[which(pool_num %in% names(heart.df.cleaned))]
pool_char_clean <- pool_char[which(pool_char %in% names(heart.df.cleaned))]
pool_num_clean_base<-pool_num_clean
pool_char_clean_base<-pool_char_clean
#removing the extra created data
rm(list = "temp")
for numerical variable we either drop numerical observations or impute by median (2 options), for categorical variables we could impute missing values by dropping the observation, impute by UNKNOWN level, or impute as missing level (2 options). Then in total we have 2*2 = 4 diffenet conditions ## Section: Creating scenarios for Design of Experiments ### 1st scenario: NA values of categorical variables are unknown and for numrical variables we drop the observation.
# We dropped the numerical NAs and replaced the NA (missing values) in categorical variables to "UNKNOWN"
heart.df.cleaned.copy<-readRDS(paste0(data.path,"heart.df.cleaned.copy.rds"))
heart.df.cleaned<-heart.df.cleaned.copy
#dropping observations with numerical NAs
heart.df.cleaned <- heart.df.cleaned[complete.cases(heart.df.cleaned[pool_num_clean]),]
heart.df.cleaned_char <- heart.df.cleaned[pool_char_clean]
heart.df.cleaned_char[is.na(heart.df.cleaned_char)] <- "UNKNOWN"
heart.df.cleaned_num <- heart.df.cleaned[pool_num_clean]
df.un.dr<-cbind(heart.df.cleaned_char,heart.df.cleaned_num,heart.df.cleaned["ID"])
#Make sure the type of each variable is recorded corrctly
for(i in names(df.un.dr)){
if(i %in% pool_char_clean){df.un.dr[i] <- as.character(df.un.dr[,i])}
if(i %in% pool_num_clean){df.un.dr[i] <- as.numeric(df.un.dr[,i])}
}
dfs[[1]]<-df.un.dr
# remove unnecessary datasets
rm(list=c("heart.df.cleaned_char","heart.df.cleaned_num","df.un.dr"))
#scenario1
If we drop simply, this is not practical because we lost all the observations. We try to maximize remaining cells through a heuristic algorithm which through a loop, we drop rows and columns based on the maximum emptiness until just non-empty cells are remained
heart.df.cleaned.copy<-readRDS(paste0(data.path,"heart.df.cleaned.copy.rds"))
heart.df.cleaned<-heart.df.cleaned.copy
#dropping observations with NAs
obj_data <- table_cleaner(heart.df.cleaned,1,"ID","year1",c("KEEP_ALL"))
df.dr.dr<-obj_data$data
#Make sure the type of each variable is recorded corrctly
for(i in names(df.dr.dr)){
if(i %in% pool_char_clean){df.dr.dr[i] <- as.character(df.dr.dr[,i])}
if(i %in% pool_num_clean){df.dr.dr[i] <- as.numeric(df.dr.dr[,i])}
}
dfs[[4]]<-df.dr.dr
# remove unnecessary datasets
rm(list=c("df.dr.dr"))
#scenario2
# The unavailable numerical are imputed by median and the NA (missing values) in categorical variables by "UNKNOWN"
heart.df.cleaned.copy<-readRDS(paste0(data.path,"heart.df.cleaned.copy.rds"))
heart.df.cleaned<-heart.df.cleaned.copy
heart.df.cleaned_num <- heart.df.cleaned[pool_num_clean]
num_med<-as.data.frame(apply(heart.df.cleaned_num,2,function(x) median(x,na.rm = TRUE) ))
for(j in pool_num_clean){
heart.df.cleaned_num[is.na(heart.df.cleaned_num[j]),j]<-num_med[j,1]
}
heart.df.cleaned_char <- heart.df.cleaned[pool_char_clean]
heart.df.cleaned_char[is.na(heart.df.cleaned_char)] <- "UNKNOWN"
df.un.med<-cbind(heart.df.cleaned_char,heart.df.cleaned_num,heart.df.cleaned["ID"])
#Make sure the type of each variable is recorded corrctly
for(i in names(df.un.med)){
if(i %in% pool_char_clean){df.un.med[i] <- as.character(df.un.med[,i])}
if(i %in% pool_num_clean){df.un.med[i] <- as.numeric(df.un.med[,i])}
}
dfs[[5]]<-df.un.med
# remove unnecessary datasets
rm(list=c("heart.df.cleaned_char","heart.df.cleaned_num","df.un.med"))
#scenario3
If we drop observations that their categorical variables are NA, we loose all the observations.We try to maximize remaining cells through a heuristic algorithm which through a loop, we drop rows and columns based on the maximum emptiness until just non-empty cells are remained
# The unavailable numerical are imputed by median and the NA (missing values) in categorical variables by "UNKNOWN"
heart.df.cleaned.copy<-readRDS(paste0(data.path,"heart.df.cleaned.copy.rds"))
heart.df.cleaned<-heart.df.cleaned.copy
heart.df.cleaned_num <- heart.df.cleaned[pool_num_clean]
num_med<-as.data.frame(apply(heart.df.cleaned_num,2,function(x) median(x,na.rm = TRUE) ))
for(j in pool_num_clean){
heart.df.cleaned_num[is.na(heart.df.cleaned_num[j]),j]<-num_med[j,1]
}
heart.df.cleaned_char <- heart.df.cleaned[pool_char_clean]
df.dr.med<-cbind(heart.df.cleaned_char,heart.df.cleaned_num,heart.df.cleaned["ID"])
obj_data2 <- table_cleaner(df.dr.med,1,"ID","year1",c("KEEP_ALL"))
df.dr.med<-obj_data2$data
#Make sure the type of each variable is recorded corrctly
for(i in names(df.dr.med)){
if(i %in% pool_char_clean){df.dr.med[i] <- as.character(df.dr.med[,i])}
if(i %in% pool_num_clean){df.dr.med[i] <- as.numeric(df.dr.med[,i])}
}
dfs[[8]]<-df.dr.med
# remove unnecessary datasets
rm(list=c("df.dr.med"))
#scenario4
rm(list=c("heart.df.cleaned_char","heart.df.cleaned_num",
"heart.df.cleaned.copy"))
In this snippet, we encode the driven data from each scenario in two ways of one-hot enconding and label enconding (4scenarios x 2 encondings =8 scenarios). Also, holdout IDs for distinguishing between train and holdout sets are identified. Also, we identify 5 holdout ID for repeated cross validation.
# We keep IDs
## identifying holdout IDs from cleaned data
## holdout would be seperated and for rest we have 5-fold cross validation
idenx_holdout <- holdout_index(heart.df.cleaned$ID, 2019)
cat_vars <- setdiff(pool_char_clean, "year1")
## encoding (method = c("numeric", "factor"))
df_num <- lapply(dfs, function(x) encode_cat(x,cat_vars,"numeric"))
df_cat <- lapply(dfs, function(x) encode_cat(x,cat_vars,"factor"))
All_data <- c(df_num, df_cat)
## assign names for each component in the list
names(All_data) <- c(paste0("NUM_",1:4), paste0("CAT_",1:4))
saveRDS(All_data,paste0(data.path,"All_data.rds"))
## remove unused objects
rm(dfs,df_num,df_cat)
In our study, three feature selection methods are adopted: Fast correlation based feature selection, LASSO and Random Forest. This section summarizes the results from these variable selection algorithms. In order to use the code we provide here, you will need to have some experience in parallel computation using parSapply() function in the R package: snow. The Random Forest selection algorithm takes lots of time to find important features in the data. The part is conducted on Ohio Supercomputer Center. The Code is designed to run for all of the timestamps. However, as mentioned in the paper, DoE for just year-1 is performed.
# The data is loaded from local drive
All_data<-readRDS(paste0(data.path,"All_data.rds"))
# Here, we use the function parSapply() in snow package to perform the parallel computation for three variable selection algorithms
# for checking the definition of the feature selection algorithms, check this paper:
# Fonti, Valeria, and Eduard Belitser. "Feature Selection using LASSO."
features <- rep(list(NA), 4)
names(features) <- c("FFS", "LASSO", "RF", "all")
impute_no <- 8 #from 1 to 8
#=============================================================
#==============Fast Feature selection
#=============================================================
features_FFS <- vector(mode="list", impute_no)
for (i in 1:impute_no){
cl <- makeCluster(4, type="SOCK")
features_FFS[[i]] <- parSapply(cl, 1:5, select_vars, All_data[[i]], "year1", "FFS", idenx_holdout, 2090)
stopCluster(cl)
}
#=============================================================
#==============Lasso Feature selection for Binomial TARGETS
#=============================================================
features_LASSO <- vector(mode="list", impute_no)
for (i in 1:impute_no){
cl <- makeCluster(4, type="SOCK")
features_LASSO[[i]] <- parSapply(cl, 1:5, select_vars, All_data[[i]], "year1", "LASSO", idenx_holdout, 2090)
stopCluster(cl)
}
save.image("C:/Users/hahadydolatsara/OneDrive - Auburn University/Transplant/BUAL-LAB/DoE/save_files/newsave4.RData")
#=============================================
#==============Random Forest Feature Selection
#=============================================
features_RF <- vector(mode="list", impute_no)
for (i in 1:impute_no){
cl <- makeCluster(4, type="SOCK")
features_RF[[i]] <- parSapply(cl, 1:5, select_vars, All_data[[i]], "year1", "RF", idenx_holdout, 2090)
stopCluster(cl)
}
A full factorial design. The following block of codes are executed over Ohio Super Computer.
# here you can find how we setup data preparation and feature selection steps in the earlier section
# for each style of factorial design we could find the ID and then use associated data and
# features that we used
data_design<-as.data.frame(matrix(0, ncol = 3, nrow = 8))
names(data_design)<-c("numerical_imputation","categorical_imputation","encoding")
data_design$numerical_imputation<-rep(rep(c("DROP", "MEDIAN"),each=2),2)
data_design$categorical_imputation<-rep(c("UNKNOWN", "DROP"),4)
data_design$encoding<-rep(c("LABEL", "HARD"),each=4)
data_design$ID<-rownames(data_design)
feature_selection<-c("FFS", "LASSO","RF")
data_design_features<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(data_design)*length(feature_selection)))
names(data_design_features)<-c("features")
data_design_features$features<-rep(c("FFS", "LASSO","RF"),each=nrow(data_design))
data_design_features<-cbind(data_design_features, data_design[rep(seq_len(nrow(data_design)), length(feature_selection)),])
resampling<-c("none", "down", "up", "rose", "smote")
DDFB<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(data_design_features)*length(resampling)))
names(DDFB)<-c("resampling")
DDFB$resampling<-rep(resampling,each=nrow(data_design_features))
DDFB<-cbind(DDFB, data_design_features[rep(seq_len(nrow(data_design_features)), length(resampling)),])
algorithm<-c("glm" , "svm","nnet" , "gbm" , "xgbDART", "rpart" , "ranger")
DDFBA<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(DDFB)*length(algorithm)))
names(DDFBA)<-c("algorithm")
DDFBA$algorithm<-rep(algorithm,each=nrow(DDFB))
DDFBA<-cbind(DDFBA, DDFB[rep(seq_len(nrow(DDFB)), length(algorithm)),])
# here folds refers to number of repeats in "repeated CV" approach
folds<-5
DDFBAF<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(DDFBA)*folds))
names(DDFBAF)<-c("fold")
DDFBAF$fold<-rep(c(1:folds),each=nrow(DDFBA))
DDFBAF<-cbind(DDFBAF, DDFBA[rep(seq_len(nrow(DDFBA)), folds),])
# These packages are needed for running over the supercomputer
if(!"pacman" %in% rownames(installed.packages())){
install.packages(pkgs = "pacman",repos = "http://cran.us.r-project.org")
}
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(caret,AUC,MASS,ROSE,DMwR,snow,ranger,parallel,xgboost,gbm,naivebayes,e1071,kernlab,Rmpi)
# design_xxx.csv has information of the full factorial design
# instead of glm in below, for each algorithm the algorithm name should be subsituted
# an example is: "design_lda.csv". The other algorithms' names are:
# (LR, ANN, SVM, CART, RF, GB and XGB)
# server_base.rds contains the following data:
# DDFBAF: it has data of the full factorial design which has 2*2*2*3*5*7 rows
# idenx_holdout: index of the holdout set
# features: features that attained after feature selection
# if Operating System is Windows means running over PC
# server base is large file that has all data, all possible designs, index of holdout,
# and features that are derived in the previous sections
if(Sys.info()[1]=="Windows"){
server_base<-readRDS(paste0(data.path,"server_base.rds"))
design_import<-read.csv(paste0(data.path,"design_glm.csv"))
}else{
server_base<-readRDS("/users/PMIU0138/miu0150/Transplant/data/server_base.rds")
design_import<-read.csv("/users/PMIU0138/miu0150/Transplant/data/design_glm.csv")
}
design<-list()
design$resampling<-as.character(design_import$resampling[!is.na(design_import$resampling)])
design$features<-as.character(design_import$features[!is.na(design_import$features)])
design$categorical_imputation<-as.character(design_import$categorical_imputation[!is.na(design_import$categorical_imputation)])
design$algorithm<-as.character(design_import$algorithm[!is.na(design_import$algorithm)])
design$Dname<-as.character(design_import$Dname[!is.na(design_import$Dname)])
DDFBAF<-server_base$DDFBAF
All_data<-server_base$All_data
idenx_holdout<-server_base$idenx_holdout
features<-server_base$features
# here is the screening design formation
Full_design<-DDFBAF[which( (DDFBAF$algorithm %in% design$algorithm) & (DDFBAF$resampling %in% design$resampling)
& (DDFBAF$features %in% design$features)
& (DDFBAF$categorical_imputation %in% design$categorical_imputation) ) ,]
# because it might be better to call it here instead of connecting to Github for each processes in parallel mode.
###########main function###################################################
train_para<-function(iii, design_table0 ,All_data0,TARGET0="year1",
Index_test0=idenx_holdout, features0,seed0=2019){
gc()
if(!"pacman" %in% rownames(installed.packages())){
install.packages(pkgs = "pacman",repos = "http://cran.us.r-project.org")
}
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(caret,AUC,MASS,ROSE,DMwR,snow,ranger,parallel,xgboost,gbm,naivebayes,e1071,kernlab,parallel)
set.seed(seed0)
fold_no<-design_table0[iii,"fold"]
alg_fact<-design_table0[iii,"algorithm"]
resample_fact<-design_table0[iii,"resampling"]
feature_fact<-design_table0[iii,"features"]
impute_num_fact<-design_table0[iii,"numerical_imputation"]
impute_cat_fact<-design_table0[iii,"categorical_imputation"]
encoding_fact<-design_table0[iii,"encoding"]
data_fact<-as.numeric(design_table0[iii,"ID"])
# putting summary of the experiment in here
experiment.summary<-as.data.frame(matrix(0, ncol = 9, nrow = 1))
names(experiment.summary)<-c("data_scenario","algorithm","resampling","feature_selection",
"imputation_num","imputation_cat","encoding","fold","TARGET")
experiment.summary[1,]<-c(data_fact,alg_fact,resample_fact,feature_fact,
impute_num_fact,impute_cat_fact,encoding_fact,fold_no,TARGET0)
df <- All_data0[[data_fact]]
index_t<- which(df$ID%in%Index_test0[[fold_no]])
vars<-features0[feature_fact][[1]][[data_fact]][[fold_no]]
df <- df[,which(names(df)%in% c(vars,as.character(TARGET0[1])) )]
hold_out <- df[index_t,]
traindata <- df[-index_t,]
traindata$ID <- NULL
set.seed(seed0+fold_no)
# 1: survival; 0: death
traindata[,as.character(TARGET0)] <- as.factor(ifelse(traindata[,as.character(TARGET0)]=="0", "Death", "Survival"))
hold_out[,as.character(TARGET0)] <- as.factor(ifelse(hold_out[,as.character(TARGET0)]=="0", "Death", "Survival"))
formul<-as.formula(paste0(as.character(TARGET0[1]),"~."))
run.code<-"YES"
sampling.method<-resample_fact
if(resample_fact=="none"){sampling.method<-NULL}
# Reference for geometric mean
# Kim, Myoung-Jong, Dae-Ki Kang, and Hong Bae Kim. "Geometric mean based boosting
# algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction."
# Expert Systems with Applications 42.3 (2015): 1074-1082.
gmeanfunction <- function(data, lev = NULL, model = NULL) {
sub_sens<-caret::sensitivity(data$pred,data$obs)
sub_spec<-caret::specificity(data$pred,data$obs)
c(gmean = sqrt(sub_sens*sub_spec))
}
# I used 5 fold cross validation
control_setting <- caret::trainControl(method = "cv", number=5, sampling=sampling.method , verboseIter= TRUE,
#summaryFunction = twoClassSummary,
search="random", classProbs = TRUE, selectionFunction="best"
,summaryFunction = gmeanfunction)
#models:
# KPLS: kernelpls
# CART: rpart
# XGB: xgbDART
# random forest (RF): ranger
# logistic regression (LR): glm
if (alg_fact%in% c("glm", "nnet", "svmRadial","kernelpls")){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact, family="binomial",
trControl = control_setting, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}else if (alg_fact%in%c("rf", "gbm", "earth", "rpart", "xgbTree", "naive_bayes","xgbDART" ,"ranger","glmnet")){
#if(alg_fact=="earth"){alg_fact<-"glmnet"}
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact,
trControl = control_setting, tuneLength=10, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}else if(alg_fact=="lda"){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact, preProcess="pca", preProcOptions = list(method="BoxCox"),
trControl = control_setting, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}else if(alg_fact=="treebag"){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact, family="binomial",
trControl = control_setting, tuneLength=10, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}
if(run.code=="YES"){
resul_raw <- as.data.frame(matrix(NA, ncol = 3, nrow = nrow(hold_out)))
colnames(resul_raw) <- c("TARGET", alg_fact, "Probability")
resul_raw$TARGET <- hold_out[as.character(TARGET0)]
train_raw <- as.data.frame(matrix(NA, ncol = 3, nrow = nrow(traindata)))
colnames(train_raw) <- c("TARGET", alg_fact, "Probability")
train_raw$TARGET <- traindata[as.character(TARGET0)]
resul_pred_perf<-as.data.frame(matrix(NA, ncol = 1, nrow = 4))
colnames(resul_pred_perf)<-c(alg_fact)
rownames(resul_pred_perf)<-c("auc","sen","spec","accu")
#if(class(try(varImp(result_model),silent = TRUE))!="try-error"){
train_raw$Probability <- predict(result_model, newdata=traindata, type="prob")[,2]
train_raw[alg_fact] <- predict(result_model, newdata=traindata, type="raw")
#train_auc <- AUC::auc(roc(train_raw$Probability, traindata[,TARGET]))
resul_raw[alg_fact] <- predict(result_model, newdata=hold_out, type="raw")
resul_raw$Probability <- predict(result_model, newdata=hold_out, type="prob")[,2]
resul_pred_perf[1,1] <- AUC::auc(roc(resul_raw$Probability,hold_out[,as.character(TARGET0)]))
resul_pred_perf[2,1] <- caret::sensitivity(resul_raw[,alg_fact],hold_out[,as.character(TARGET0)])
resul_pred_perf[3,1] <- caret::specificity(resul_raw[,alg_fact],hold_out[,as.character(TARGET0)])
resul_pred_perf[4,1] <- (as.data.frame(confusionMatrix(resul_raw[,alg_fact], hold_out[,as.character(TARGET0[1])])$overall))[1,]
#}
gmean_overal<-as.data.frame(result_model$results)
gmean_folds<-as.data.frame(result_model$resample)
equat<-as.data.frame(result_model$finalModel$coefficients)
names(equat)<-"coef"
equat$vars<-rownames(equat)
equat$vars[which(equat$vars==c("(Intercept)"))]<-c("INTERCEPT")
dir.create(paste0(getwd(),"/data_pile"))
rm(list=setdiff(ls(), c("experiment.summary","resul_pred_perf","resul_raw","alg_fact","iii","gmean_overal","gmean_folds","equat")))
save.image(paste0(getwd(),"/data_pile/",alg_fact,"-",iii,".RData"))
saveRDS(list(experiment.summary=experiment.summary, Performance=resul_pred_perf, Predicted=resul_raw,gmean_overal=gmean_overal,
gmean_folds=gmean_folds,equat=equat),
paste0(getwd(),"/data_pile/",alg_fact,"-",iii,".rds"))
return(list(experiment.summary=experiment.summary, Performance=resul_pred_perf, Predicted=resul_raw))
}else{
dir.create(paste0(getwd(),"/data_pile"))
rm(list=setdiff(ls(), c("experiment.summary","alg_fact","iii")))
save.image(paste0(getwd(),"/data_pile/NOT-",alg_fact,"-",iii,".RData"))
saveRDS(list(experiment.summary=experiment.summary, Performance="NOT", Predicted="NOT",gmean_overal="NOT",
gmean_folds="NOT",equat="NOT"),
paste0(getwd(),"/data_pile/NOT-",alg_fact,"-",iii,".rds"))
return(list(experiment.summary=experiment.summary, Performance="NOT", Predicted="NOT",gmean_overal="NOT",
gmean_folds="NOT",equat="NOT"))
}
}
##########################################################################
#identifying number of created clusters for parallel processing
if(Sys.info()[1]=="Windows"){
library(snow)
cl <- parallel::makeCluster(4)
}else{
library(Rmpi)
library(snow)
# assign cores used in the parallel computing
workers <- as.numeric(Sys.getenv(c("PBS_NP"))) - 1
s.no<-nrow(screen_design)
workers.no<-min(workers,s.no)
cl <- makeCluster(workers.no,"MPI")
}
time.begin <- proc.time()[3]
results <- parSapply(cl, row.names(screen_design), train_para,design_table0=screen_design,
All_data0=All_data,TARGET0="year1",
Index_test0=idenx_holdout, features0=features,seed0=2019)
time.window <- proc.time()[3] - time.begin
stopCluster(cl)
rm(list=setdiff(ls(), c("results","design")))
# save.image(paste0("saver_glm.RData"))
saveRDS(results,paste0("/users/PMIU0138/miu0150/Transplant/data/_",design$Dname,".rds"))
mpi.quit()
After running the designs over Ohio Super Computer the following block of codes merge performances of all the scenarions into a file.
All_data<-readRDS(paste0(data.path,"All_data.rds"))
features<-readRDS(paste0(data.path,"features.rds"))
# here you can find how we setup data preparation and feature selection steps in the earlier section
# for each style of factorial design we could find the ID and then use associated data and
# features that we used
data_design<-as.data.frame(matrix(0, ncol = 3, nrow = 8))
names(data_design)<-c("numerical_imputation","categorical_imputation","encoding")
data_design$numerical_imputation<-rep(rep(c("DROP", "MEDIAN"),each=2),2)
data_design$categorical_imputation<-rep(c("UNKNOWN", "DROP"),4)
data_design$encoding<-rep(c("LABEL", "HARD"),each=4)
data_design$ID<-rownames(data_design)
feature_selection<-c("FFS", "LASSO","RF")
data_design_features<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(data_design)*length(feature_selection)))
names(data_design_features)<-c("features")
data_design_features$features<-rep(c("FFS", "LASSO","RF"),each=nrow(data_design))
data_design_features<-cbind(data_design_features, data_design[rep(seq_len(nrow(data_design)), length(feature_selection)),])
resampling<-c("none", "down", "up", "rose", "smote")
DDFB<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(data_design_features)*length(resampling)))
names(DDFB)<-c("resampling")
DDFB$resampling<-rep(resampling,each=nrow(data_design_features))
DDFB<-cbind(DDFB, data_design_features[rep(seq_len(nrow(data_design_features)), length(resampling)),])
# the following are the CARET names of the algorithms we used for this study.
algorithm<-c("glm" , "svm", "nnet" , "gbm" , "xgbDART", "rpart" , "ranger")
DDFBA<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(DDFB)*length(algorithm)))
names(DDFBA)<-c("algorithm")
DDFBA$algorithm<-rep(algorithm,each=nrow(DDFB))
DDFBA<-cbind(DDFBA, DDFB[rep(seq_len(nrow(DDFB)), length(algorithm)),])
# here folds refers to number of repeats in "repeated CV" approach
folds<-5
DDFBAF<-as.data.frame(matrix(0, ncol = 1, nrow = nrow(DDFBA)*folds))
names(DDFBAF)<-c("fold")
DDFBAF$fold<-rep(c(1:folds),each=nrow(DDFBA))
DDFBAF<-cbind(DDFBAF, DDFBA[rep(seq_len(nrow(DDFBA)), folds),])
server_base<-readRDS(paste0(data.path,"server_base.rds"))
DDFBAF<-server_base$DDFBAF
DDFBAF$auc<-NA
DDFBAF$sen<-NA
DDFBAF$spec <-NA
DDFBAF$accu<-NA
DDFBAF$data_scenario<-DDFBAF$ID
DDFBAF$ID<-row.names(DDFBAF)
DDFBAF$imputation_cat<-DDFBAF$categorical_imputation
DDFBAF$imputation_num<-DDFBAF$numerical_imputation
DDFBAF$TARGET<-"year1"
DDFBAF$feature_selection<-DDFBAF$features
DDFBAF[c("categorical_imputation","numerical_imputation","features")]<-NULL
# There is two ways to provide a report that has all the results
# First way: Here is the way if you alread have a large "*.rds" file that has results of all the scenarios together
# instead of rpart in below, for each algorithm the algorithm name should be subsituted
# an example is: "glm.rds". The pther algorithms' names are:
# "lda","glmnet", "nnet" ,"rpart" (CART),"xgbDART" (XGB), and "ranger" (RF)
rpart<-readRDS(paste0(data.path,"_full_naive_bayes.rds"))
rpart.summary<-as.data.frame(matrix(0, ncol = 14, nrow = ncol(rpart)))
names(rpart.summary)<-c("data_scenario","algorithm","resampling","feature_selection",
"imputation_num","imputation_cat","encoding","fold","TARGET",
"auc","sen","spec","accu","ID")
for(i in 1:ncol(rpart)){
rpart.summary[i,1:9]<-as.character(rpart[[1,i]])
rpart.summary[i,10:13]<-as.numeric(t(rpart[[2,i]]))
rpart.summary[i,14]<-colnames(rpart)[i]
}
rpart.summary<-rpart.summary[complete.cases(rpart.summary),]
rpart.summary$gmean<-sqrt(rpart.summary$sen*rpart.summary$spec)
write.csv(rpart.summary,"rpart.summary.csv",row.names = FALSE)
# Second way: Here is the way if results of each scenario is reported seperately
# instead of rpart in below, for each algorithm the algorithm name should be subsituted
# an example is: "glm.rds". The pther algorithms' names are:
# "lda","glmnet", "nnet" ,"rpart" (CART),"xgbDART" (XGB), and "ranger" (RF)
for(i in row.names(DDFBAF)){
if((class(try(
data.temp<-readRDS(paste0(data.path,"/all_scenarios/",DDFBAF[i,"algorithm"],"-",i,".rds")),silent = TRUE))!="try-error")){
data.temp<-as.data.frame(t(data.temp$Performance))
DDFBAF[i,"auc"]<-data.temp$auc
DDFBAF[i,"sen"]<-data.temp$sen
DDFBAF[i,"spec"] <-data.temp$spec
DDFBAF[i,"accu"]<-data.temp$accu
}
}
rpart<-DDFBAF[which(DDFBAF$algorithm=="rpart" & !is.na(DDFBAF$auc)),]
rpart$gmean<-sqrt(rpart$sen*rpart$spec)
write.csv(rpart.summary,"rpart.summary.csv",row.names = FALSE)
The dataset is fairly large and random selection is performed. The results of the best model in both repeated CV and CV was same. In all the scenarion, the training models find the highest GMEAN. However, other outcomes of the scenarios are investigated and among them scenarios that highest “accuracy”, “sensitivity”, “specificity”, and “AUC” are reported
# Here summary of all the scenarios are provided
# Here summary of all the scenarios are provided
glm_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/glm_res.csv")
nnet_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/nnet_res.csv")
ranger_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/ranger_res.csv")
xgbDART_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/xgbDART_res.csv")
rpart_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/rpart_res.csv")
# showing performance of screening algorithms
screen_glm_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/_screen_glm_results_gmean.csv")
screen_ranger_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/_screen_ranger_results_gmean.csv")
DF.names<-c("auc" , "sen" , "spec", "accu" , "data_scenario","algorithm","resampling" , "feature_selection", "imputation_num","imputation_cat" ,"encoding","fold" ,"TARGET")
names(screen_glm_res)<-DF.names
names(screen_ranger_res)<-DF.names
screen_glm_res$gmean<-sqrt(screen_glm_res$sen*screen_glm_res$spec)
screen_ranger_res$gmean<-sqrt(screen_ranger_res$sen*screen_ranger_res$spec)
pvt_glm_mean<-pvt_maker(data_res=glm_res,functio_pvt=c("mean"))
pvt_nnet_mean<-pvt_maker(data_res=nnet_res,functio_pvt=c("mean"))
pvt_ranger_mean<-pvt_maker(data_res=ranger_res,functio_pvt=c("mean"))
pvt_xgbDART_mean<-pvt_maker(data_res=xgbDART_res,functio_pvt=c("mean"))
pvt_rpart_mean<-pvt_maker(data_res=rpart_res,functio_pvt=c("mean"))
###########################Full Design step############################################################
if(!"pacman" %in% rownames(installed.packages())){
install.packages(pkgs = "pacman",repos = "http://cran.us.r-project.org")
}
library(pacman)
p_load(DoE.base,FrF2,DoE.wrapper,RcmdrPlugin.Export,
RcmdrPlugin.FactoMineR,RcmdrPlugin.HH
,RcmdrPlugin.TeachingDemos,RcmdrPlugin.orloca,Rcmdr,conf.design
,lhs,AlgDesign,DiceDesign,rsm,RcmdrPlugin.DoE,pivottabler,data.table,DT , htmltools, nortest)
# excluding total rows for full factorial design
DT_glm <- data.table(rownames(pvt_glm_mean), result=grepl("Total", rownames(pvt_glm_mean)))
DT_glm<-as.data.frame(DT_glm)
pvt_glm_mean<-pvt_glm_mean[which(DT_glm$result==FALSE),]
DT_nnet <- data.table(rownames(pvt_nnet_mean), result=grepl("Total", rownames(pvt_nnet_mean)))
DT_nnet<-as.data.frame(DT_nnet)
pvt_nnet_mean<-pvt_nnet_mean[which(DT_nnet$result==FALSE),]
DT_ranger <- data.table(rownames(pvt_ranger_mean), result=grepl("Total", rownames(pvt_ranger_mean)))
DT_ranger<-as.data.frame(DT_ranger)
pvt_ranger_mean<-pvt_ranger_mean[which(DT_ranger$result==FALSE),]
DT_xgbDART <- data.table(rownames(pvt_xgbDART_mean), result=grepl("Total", rownames(pvt_xgbDART_mean)))
DT_xgbDART<-as.data.frame(DT_xgbDART)
pvt_xgbDART_mean<-pvt_xgbDART_mean[which(DT_xgbDART$result==FALSE),]
DT_rpart <- data.table(rownames(pvt_rpart_mean), result=grepl("Total", rownames(pvt_rpart_mean)))
DT_rpart<-as.data.frame(DT_rpart)
pvt_rpart_mean<-pvt_rpart_mean[which(DT_rpart$result==FALSE),]
model_best <- as.data.frame(matrix(NA, ncol = 10, nrow = 5))
colnames(model_best) <- c("encoding","feature_selection","Num_treat","Cat_treat","resampling",
"auc_mean","sen_mean","spec_mean","accu_mean","gmean_mean")
row.names(model_best)<-c("best_accuracy","best_sensitivity","best_specificity","best_AUC","best_GMEAN")
glm_best<-model_best
ranger_best<-model_best
nnet_best<-model_best
xgbDART_best<-model_best
rpart_best<-model_best
#####
glm_best[1,1:5]<-unlist(strsplit(rownames(pvt_glm_mean[which(pvt_glm_mean$accu_mean==max(pvt_glm_mean$accu_mean)),])," "))
glm_best[1,6:10]<-pvt_glm_mean[which(pvt_glm_mean$accu_mean==max(pvt_glm_mean$accu_mean)),]
glm_best[2,1:5]<-unlist(strsplit(rownames(pvt_glm_mean[which(pvt_glm_mean$sen_mean==max(pvt_glm_mean$sen_mean)),])," "))
glm_best[2,6:10]<-pvt_glm_mean[which(pvt_glm_mean$sen_mean==max(pvt_glm_mean$sen_mean)),]
glm_best[3,1:5]<-unlist(strsplit(rownames(pvt_glm_mean[which(pvt_glm_mean$spec_mean==max(pvt_glm_mean$spec_mean)),])," "))
glm_best[3,6:10]<-pvt_glm_mean[which(pvt_glm_mean$spec_mean==max(pvt_glm_mean$spec_mean)),]
glm_best[4,1:5]<-unlist(strsplit(rownames(pvt_glm_mean[which(pvt_glm_mean$auc_mean==max(pvt_glm_mean$auc_mean)),])," "))
glm_best[4,6:10]<-pvt_glm_mean[which(pvt_glm_mean$auc_mean==max(pvt_glm_mean$auc_mean)),]
glm_best[5,1:5]<-unlist(strsplit(rownames(pvt_glm_mean[which(pvt_glm_mean$gmean_mean==max(pvt_glm_mean$gmean_mean)),])," "))
glm_best[5,6:10]<-pvt_glm_mean[which(pvt_glm_mean$gmean_mean==max(pvt_glm_mean$gmean_mean)),]
nnet_best[1,1:5]<-unlist(strsplit(rownames(pvt_nnet_mean[which(pvt_nnet_mean$accu_mean==max(pvt_nnet_mean$accu_mean)),])," "))
nnet_best[1,6:10]<-pvt_nnet_mean[which(pvt_nnet_mean$accu_mean==max(pvt_nnet_mean$accu_mean)),]
nnet_best[2,1:5]<-unlist(strsplit(rownames(pvt_nnet_mean[which(pvt_nnet_mean$sen_mean==max(pvt_nnet_mean$sen_mean)),])," "))
nnet_best[2,6:10]<-pvt_nnet_mean[which(pvt_nnet_mean$sen_mean==max(pvt_nnet_mean$sen_mean)),]
nnet_best[3,1:5]<-unlist(strsplit(rownames(pvt_nnet_mean[which(pvt_nnet_mean$spec_mean==max(pvt_nnet_mean$spec_mean)),])," "))
nnet_best[3,6:10]<-pvt_nnet_mean[which(pvt_nnet_mean$spec_mean==max(pvt_nnet_mean$spec_mean)),]
nnet_best[4,1:5]<-unlist(strsplit(rownames(pvt_nnet_mean[which(pvt_nnet_mean$auc_mean==max(pvt_nnet_mean$auc_mean)),])," "))
nnet_best[4,6:10]<-pvt_nnet_mean[which(pvt_nnet_mean$auc_mean==max(pvt_nnet_mean$auc_mean)),]
nnet_best[5,1:5]<-unlist(strsplit(rownames(pvt_nnet_mean[which(pvt_nnet_mean$gmean_mean==max(pvt_nnet_mean$gmean_mean)),])," "))
nnet_best[5,6:10]<-pvt_nnet_mean[which(pvt_nnet_mean$gmean_mean==max(pvt_nnet_mean$gmean_mean)),]
ranger_best[1,1:5]<-unlist(strsplit(rownames(pvt_ranger_mean[which(pvt_ranger_mean$accu_mean==max(pvt_ranger_mean$accu_mean)),])," "))
ranger_best[1,6:10]<-pvt_ranger_mean[which(pvt_ranger_mean$accu_mean==max(pvt_ranger_mean$accu_mean)),]
ranger_best[2,1:5]<-unlist(strsplit(rownames(pvt_ranger_mean[which(pvt_ranger_mean$sen_mean==max(pvt_ranger_mean$sen_mean)),])," "))
ranger_best[2,6:10]<-pvt_ranger_mean[which(pvt_ranger_mean$sen_mean==max(pvt_ranger_mean$sen_mean)),]
ranger_best[3,1:5]<-unlist(strsplit(rownames(pvt_ranger_mean[which(pvt_ranger_mean$spec_mean==max(pvt_ranger_mean$spec_mean)),])," "))
ranger_best[3,6:10]<-pvt_ranger_mean[which(pvt_ranger_mean$spec_mean==max(pvt_ranger_mean$spec_mean)),]
ranger_best[4,1:5]<-unlist(strsplit(rownames(pvt_ranger_mean[which(pvt_ranger_mean$auc_mean==max(pvt_ranger_mean$auc_mean)),])," "))
ranger_best[4,6:10]<-pvt_ranger_mean[which(pvt_ranger_mean$auc_mean==max(pvt_ranger_mean$auc_mean)),]
ranger_best[5,1:5]<-unlist(strsplit(rownames(pvt_ranger_mean[which(pvt_ranger_mean$gmean_mean==max(pvt_ranger_mean$gmean_mean)),])," "))
ranger_best[5,6:10]<-pvt_ranger_mean[which(pvt_ranger_mean$gmean_mean==max(pvt_ranger_mean$gmean_mean)),]
xgbDART_best[1,1:5]<-unlist(strsplit(rownames(pvt_xgbDART_mean[which(pvt_xgbDART_mean$accu_mean==max(pvt_xgbDART_mean$accu_mean)),])," "))
xgbDART_best[1,6:10]<-pvt_xgbDART_mean[which(pvt_xgbDART_mean$accu_mean==max(pvt_xgbDART_mean$accu_mean)),]
xgbDART_best[2,1:5]<-unlist(strsplit(rownames(pvt_xgbDART_mean[which(pvt_xgbDART_mean$sen_mean==max(pvt_xgbDART_mean$sen_mean)),])," "))
xgbDART_best[2,6:10]<-pvt_xgbDART_mean[which(pvt_xgbDART_mean$sen_mean==max(pvt_xgbDART_mean$sen_mean)),]
xgbDART_best[3,1:5]<-unlist(strsplit(rownames(pvt_xgbDART_mean[which(pvt_xgbDART_mean$spec_mean==max(pvt_xgbDART_mean$spec_mean)),])," "))
xgbDART_best[3,6:10]<-pvt_xgbDART_mean[which(pvt_xgbDART_mean$spec_mean==max(pvt_xgbDART_mean$spec_mean)),]
xgbDART_best[4,1:5]<-unlist(strsplit(rownames(pvt_xgbDART_mean[which(pvt_xgbDART_mean$auc_mean==max(pvt_xgbDART_mean$auc_mean)),])," "))
xgbDART_best[4,6:10]<-pvt_xgbDART_mean[which(pvt_xgbDART_mean$auc_mean==max(pvt_xgbDART_mean$auc_mean)),]
xgbDART_best[5,1:5]<-unlist(strsplit(rownames(pvt_xgbDART_mean[which(pvt_xgbDART_mean$gmean_mean==max(pvt_xgbDART_mean$gmean_mean)),])," "))
xgbDART_best[5,6:10]<-pvt_xgbDART_mean[which(pvt_xgbDART_mean$gmean_mean==max(pvt_xgbDART_mean$gmean_mean)),]
rpart_best[1,1:5]<-unlist(strsplit(rownames(pvt_rpart_mean[which(pvt_rpart_mean$accu_mean==max(pvt_rpart_mean$accu_mean)),])," "))
rpart_best[1,6:10]<-pvt_rpart_mean[which(pvt_rpart_mean$accu_mean==max(pvt_rpart_mean$accu_mean)),]
rpart_best[2,1:5]<-unlist(strsplit(rownames(pvt_rpart_mean[which(pvt_rpart_mean$sen_mean==max(pvt_rpart_mean$sen_mean)),])," "))
rpart_best[2,6:10]<-pvt_rpart_mean[which(pvt_rpart_mean$sen_mean==max(pvt_rpart_mean$sen_mean)),]
rpart_best[3,1:5]<-unlist(strsplit(rownames(pvt_rpart_mean[which(pvt_rpart_mean$spec_mean==max(pvt_rpart_mean$spec_mean)),])," "))
rpart_best[3,6:10]<-pvt_rpart_mean[which(pvt_rpart_mean$spec_mean==max(pvt_rpart_mean$spec_mean)),]
rpart_best[4,1:5]<-unlist(strsplit(rownames(pvt_rpart_mean[which(pvt_rpart_mean$auc_mean==max(pvt_rpart_mean$auc_mean)),])," "))
rpart_best[4,6:10]<-pvt_rpart_mean[which(pvt_rpart_mean$auc_mean==max(pvt_rpart_mean$auc_mean)),]
rpart_best[5,1:5]<-unlist(strsplit(rownames(pvt_rpart_mean[which(pvt_rpart_mean$gmean_mean==max(pvt_rpart_mean$gmean_mean)),])," "))
rpart_best[5,6:10]<-pvt_rpart_mean[which(pvt_rpart_mean$gmean_mean==max(pvt_rpart_mean$gmean_mean)),]
cat("##########Full Factorial Design#################")
MEAN_all_GMEAN<-as.data.frame(rbind( glm_best["best_GMEAN",],lda_best["best_GMEAN",],glmnet_best["best_GMEAN",],
kernelpls_best["best_GMEAN",],nnet_best["best_GMEAN",], naive_bayes_best["best_GMEAN",],
ranger_best["best_GMEAN",],xgbDART_best["best_GMEAN",],rpart_best["best_GMEAN",] ))
MEAN_all_GMEAN$algorithms<-c("glm","nnet", "ranger","xgbDART","rpart")
row.names(MEAN_all_GMEAN)<-NULL
# average performance of algorithms in GMEAN
cat("Peformance of the Best Combinations in GMEAN:")
DT::datatable(MEAN_all_GMEAN)
Here the performance of algorithms are compared. Since a majority “svm” and “GB” cases has not merged to a result after 24 hours implementation over supercomputer, they are not presented in here.
full_fact_res<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/best-gmean-repeat-CV.csv")
DT::datatable(full_fact_res)
Now, the best scenario is: Logistic Regression, One-Hot encoding, LASSO feature seceltion, Median for Numerical variables’ imputation, “UNKNOWN” label for Categorical variables’ imputation, and Up sampling. This combination is used for the rest of years. In section 4, holdout sets are in all timestamps are created and performance before and after isotonic regression are compared.
data.path<-"C:/Users/hahadydolatsara/OneDrive - Clark University/Research topics/Transplantation/two-stage/data/"
# We dropped the numerical NAs and replaced the NA (missing values) in categorical variables to "UNKNOWN"
heart.df.cleaned_all<-readRDS(paste0(data.path,"heart.df.cleaned_all.rds"))
pool_char_clean_all<-readRDS(paste0(data.path,"pool_char_clean_all.rds"))
pool_num_clean_all<-readRDS(paste0(data.path,"pool_num_clean_all.rds"))
# The unavailable data in numerical variables are imputed by median
heart.df.cleaned_all_num <- heart.df.cleaned_all[pool_num_clean_all]
num_med<-as.data.frame(apply(heart.df.cleaned_all_num,2,function(x) median(x,na.rm = TRUE) ))
for(j in pool_num_clean_all){
heart.df.cleaned_all_num[is.na(heart.df.cleaned_all_num[j]),j]<-num_med[j,1]
}
# imputing the unavailable data for categorical variables by "UNKNOWN"
heart.df.cleaned_all_char <- heart.df.cleaned_all[pool_char_clean_all]
heart.df.cleaned_all_char[is.na(heart.df.cleaned_all_char)] <- "UNKNOWN"
heart.df.cleaned_all<- cbind(heart.df.cleaned_all_num,
heart.df.cleaned_all_char,
heart.df.cleaned_all[c("ID")])
pool_char_clean <- c(pool_char_clean,paste("year", seq(0,10), sep=""))
# remove unnecessary datasets
rm(list=c("heart.df.cleaned_all_char","heart.df.cleaned_all_num"))
#Make sure the type of each variable is recorded corrctly
for(i in names(heart.df.cleaned_all)){
if(i %in% pool_char_clean_all){heart.df.cleaned_all[i] <- as.character(heart.df.cleaned_all[,i])}
if(i %in% pool_num_clean_all){heart.df.cleaned_all[i] <- as.numeric(heart.df.cleaned_all[,i])}
}
In this snippet, we create the training and holdout datasets at each time point of interest. In order to both validate models on the holdout datasets and use isotonic regression to calibrate survival probabilities, we make sure all patients selected in the 10th year holdout dataset were included in previous time points. This is not a necessary procedure for either prediction purpose or calibration of survival probabilities.
We use a list object keep_NA to record the following IDs: 1. IDs for the dataset at each time point of interest 2. IDs for the holdout dataset at each time point of interest 3. IDs for the training dataset at each time point of interest
data.path<-"C:/Users/hahadydolatsara/OneDrive - Clark University/Research topics/Transplantation/two-stage/data/"
# We keep IDs
keep_NA <- rep(list(NA), 33)
for (i in 1:11){
keep_NA[[i]] <- heart.df.cleaned_all[!is.na(heart.df.cleaned_all[paste0("year", (i-1))]),"ID"]
}
names(keep_NA)[1:11] <- paste0("ID",seq(0,10))
all_sizes <- unname(unlist(lapply(keep_NA, length)))
test_sizes <- round(all_sizes*0.2)
train_sizes <- all_sizes - test_sizes
# We select the holdout sets that contain the year 10 test data
# Then we exclude those IDs from each year to find IDs for other training sets
set.seed(2019)
keep_NA[[12]] <- sample(keep_NA$ID10,test_sizes[11])
names(keep_NA)[12] <- "ID_holdout10"
for (i in 2:11){
keep_NA[[(11+i)]] <- c(keep_NA$ID_holdout10, sample(setdiff(keep_NA[[(12-i)]], keep_NA$ID_holdout10), (test_sizes[(12-i)]-length(keep_NA$ID_holdout10))))
}
names(keep_NA)[13:22] <- paste0("ID_holdout", seq(9,0))
for (i in 1:11){
keep_NA[[(22+i)]] <- setdiff(keep_NA[[i]], keep_NA[[paste0("ID_holdout",(i-1))]])
}
names(keep_NA)[23:33] <- paste0("ID_train", seq(0,10))
# The one-hot encoding algorithm was applied to independent categorical variables, the function dummy_maker() is used.
exclud <- c(paste0("year",seq(0,10)), "ID")
var_ind_char <- pool_char_clean[!pool_char_clean %in% exclud]
heart.df.cleaned_all.dum <- dummy_maker(heart.df.cleaned_all,var_ind_char)
saveRDS(heart.df.cleaned_all.dum,paste0(data.path,"heart.df.cleaned_all.dum.rds"))
saveRDS(keep_NA,paste0(data.path,"keep_NA.rds"))
# making the data ready for running feature selection over super computer
server_data<-list()
exclud <- c(paste0("year",seq(0,10)), "ID")
for (i in 1:11){
temp<-heart.df.cleaned_all.dum[heart.df.cleaned_all.dum$ID %in% keep_NA[[paste0("ID_train", i-1)]],]
temp2<- temp[!names(temp) %in% exclud]
temp2[[paste0("year",i-1)]]<-temp[[paste0("year",i-1)]]
server_data[[i]]<-temp2
}
rm(list=c("temp","temp2"))
The following code is what is performed over super computer to achive feature selection with LASSO The LASSO feature selection algorithm is loaded
data.path<-"C:/Users/hahadydolatsara/OneDrive - Clark University/Research topics/Transplantation/two-stage/data/"
#gc()
if(!"pacman" %in% rownames(installed.packages())){
install.packages(pkgs = "pacman",repos = "http://cran.us.r-project.org")
}
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(caret,AUC,MASS,ROSE,DMwR,snow,ranger,parallel,xgboost,gbm,naivebayes,e1071,kernlab,pls,Rmpi)
# making the data ready for running feature selection over super computer
heart.df.cleaned_all.dum<-readRDS("/users/PMIU0138/miu0150/Transplant/data/heart.df.cleaned_all.dum.rds")
keep_NA<-readRDS("/users/PMIU0138/miu0150/Transplant/data/keep_NA.rds")
server_data<-list()
exclud <- c(paste0("year",seq(0,10)), "ID")
for (i in 1:11){
temp<-heart.df.cleaned_all.dum[heart.df.cleaned_all.dum$ID %in% keep_NA[[paste0("ID_train", i-1)]],]
temp2<- temp[!names(temp) %in% exclud]
temp2[[paste0("year",i-1)]]<-temp[[paste0("year",i-1)]]
server_data[[i]]<-temp2
}
rm(list=c("temp","temp2"))
#identifying number of created clusters for parallel processing
if(Sys.info()[1]=="Windows"){
library(snow)
cl <- parallel::makeCluster(4)
}else{
library(Rmpi)
library(snow)
# assign cores used in the parallel computing
workers <- as.numeric(Sys.getenv(c("PBS_NP"))) - 1
s.no<-nrow(screen_design)
workers.no<-min(workers,s.no)
cl <- makeCluster(workers.no,"MPI")
}
time.begin <- proc.time()[3]
LASSO <- parSapply(cl, 1:11, LASSO_features,data=server_data,
folds=5,trace=F, alpha=1,seed=110)
time.window <- proc.time()[3] - time.begin
stopCluster(cl)
saveRDS(LASSO,paste0("/users/PMIU0138/miu0150/Transplant/data/_Features_LASSO.rds"))
mpi.quit()
The following code is what is performed over super computer to perform Logistic Regression.
if(Sys.info()[1]=="Windows"){
data.path<-"C:/Users/hahadydolatsara/OneDrive - Clark University/Research topics/Transplantation/two-stage/data/"
# LASSO has all the variables selected by LASSO in all the years
LASSO<-readRDS(paste0(data.path,"_Features_LASSO.rds"))
# keep_NA has IDs of train set and holdout sets in all the time points
keep_NA<-readRDS(paste0(data.path,"keep_NA.rds"))
heart.df.cleaned_all.dum<-readRDS(paste0(data.path,"heart.df.cleaned_all.dum.rds"))
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(caret,AUC,MASS,ROSE,DMwR,snow,ranger,parallel,xgboost,gbm,naivebayes,e1071,kernlab,pls)
}
if(Sys.info()[1]=="Linux"){
# making the data ready for running feature selection over super computer
LASSO<-readRDS("/users/PMIU0138/miu0150/Transplant/data/_Features_LASSO.rds")
heart.df.cleaned_all.dum<-readRDS("/users/PMIU0138/miu0150/Transplant/data/heart.df.cleaned_all.dum.rds")
keep_NA<-readRDS("/users/PMIU0138/miu0150/Transplant/data/keep_NA.rds")
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(caret,AUC,MASS,ROSE,DMwR,snow,ranger,parallel,xgboost,gbm,naivebayes,e1071,kernlab,pls,Rmpi)
}
train_data<-list()
holdOut_data<-list()
exclud <- c(paste0("year",seq(0,10)), "ID")
exclud_h <- c(paste0("year",seq(0,10)))
for (i in 1:11){
temp<-heart.df.cleaned_all.dum[heart.df.cleaned_all.dum$ID %in% keep_NA[[paste0("ID_train", i-1)]],]
temp2<- temp[!names(temp) %in% exclud]
temp2[[paste0("year",i-1)]]<-temp[[paste0("year",i-1)]]
train_data[[i]]<-temp2
temp0<-heart.df.cleaned_all.dum[heart.df.cleaned_all.dum$ID %in% keep_NA[[paste0("ID_holdout", i-1)]],]
temp02<- temp0[!names(temp0) %in% exclud_h]
temp02[[paste0("year",i-1)]]<-temp0[[paste0("year",i-1)]]
holdOut_data[[i]]<-temp02
}
# saveRDS(holdOut_data,paste0(data.path,"_holdOut_data.rds"))
rm(list=c("temp","temp2","temp0","temp02"))
#gc()
if(!"pacman" %in% rownames(installed.packages())){
install.packages(pkgs = "pacman",repos = "http://cran.us.r-project.org")
}
###########main function###################################################
train_para2<-function(iii, t_data,h_data,features,folds=5,resampl_meth="up",alg_used,seed0=2019){
gc()
if(!"pacman" %in% rownames(installed.packages())){
install.packages(pkgs = "pacman",repos = "http://cran.us.r-project.org")
}
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(caret,AUC,MASS,ROSE,DMwR,snow,ranger,parallel,xgboost,gbm,naivebayes,e1071,kernlab,pls,parallel)
set.seed(seed0)
fold_no<-folds
alg_fact<-alg_used
resample_fact<-resampl_meth
coef<-"NO"
var_imp<-"NO"
df <- t_data[[iii]]
hold_out_ <- h_data[[iii]]
traindata <- df[c(as.character(features[[paste0("year",iii-1)]][,]),paste0("year",iii-1) )]
hold_out<-hold_out_[c(as.character(features[[paste0("year",iii-1)]][,]),paste0("year",iii-1),"ID" )]
TARGET0<-paste0("year",iii-1)
# 1: survival; 0: death
traindata[,as.character(TARGET0)] <- as.factor(ifelse(traindata[,as.character(TARGET0)]=="0", "Death", "Survival"))
hold_out[,as.character(TARGET0)] <- as.factor(ifelse(hold_out[,as.character(TARGET0)]=="0", "Death", "Survival"))
formul<-as.formula(paste0(as.character(TARGET0),"~."))
run.code<-"YES"
sampling.method<-resample_fact
if(resample_fact=="none"){sampling.method<-NULL}
# Reference for geometric mean
# Kim, Myoung-Jong, Dae-Ki Kang, and Hong Bae Kim. "Geometric mean based boosting
# algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction."
# Expert Systems with Applications 42.3 (2015): 1074-1082.
gmeanfunction <- function(data, lev = NULL, model = NULL) {
sub_sens<-caret::sensitivity(data$pred,data$obs)
sub_spec<-caret::specificity(data$pred,data$obs)
c(gmean = sqrt(sub_sens*sub_spec))
}
We train the models with 5 fold cross validation and include the corresponding code in the following.
# I used 5 fold cross validation
control_setting <- caret::trainControl(method = "cv", number=fold_no, sampling=sampling.method ,
#summaryFunction = twoClassSummary,
# the next one is for saving the predictions
savePredictions = "all",
search="random", classProbs = TRUE, selectionFunction="best"
,summaryFunction = gmeanfunction)
if (alg_fact%in% c("glm", "nnet", "svmRadial")){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact, family="binomial",
trControl = control_setting, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}else if (alg_fact%in%c("rf", "gbm", "earth", "rpart", "xgbTree", "naive_bayes","xgbDART" ,"ranger","glmnet")){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact,
trControl = control_setting, tuneLength=10, metric="gmean"),silent = TRUE))=="try-error")[1]){ run.code<-"NO"}
}else if(alg_fact=="lda"){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact, preProcess="pca", preProcOptions = list(method="BoxCox"),
trControl = control_setting, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}else if(alg_fact=="treebag"){
if((class(try( result_model <- train(formul, data=traindata, method=alg_fact, family="binomial",
trControl = control_setting, tuneLength=10, metric="gmean"),silent = TRUE))=="try-error")[1]){run.code<-"NO"}
}
if(run.code=="YES"){
coef<-as.data.frame(summary(result_model)$coefficients)
coef$vars<-rownames(coef)
coef$vars<-as.character(sapply(coef$vars,function(x) gsub("1","",x)))
coef$vars[which(coef$vars=="(Intercept)")]<-"INTERCEPT"
coef$vars[match("DAYS_STAT1",rownames(coef))]<-"DAYS_STAT1"
coef$vars[match("DAYS_STAT1A",rownames(coef))]<-"DAYS_STAT1A"
coef$vars[match("DAYS_STAT1B",rownames(coef))]<-"DAYS_STAT1B"
if(alg_fact=="glm"){
var_imp<-as.data.frame(caret::varImp(result_model)$importance)
var_imp$vars<-rownames(var_imp)
var_imp$vars<-as.character(sapply(var_imp$vars,function(x) gsub("1","",x)))
var_imp$vars[match("DAYS_STAT1",rownames(var_imp))]<-"DAYS_STAT1"
var_imp$vars[match("DAYS_STAT1A",rownames(var_imp))]<-"DAYS_STAT1A"
var_imp$vars[match("DAYS_STAT1B",rownames(var_imp))]<-"DAYS_STAT1B"
}
resul_raw <- as.data.frame(matrix(NA, ncol = 4, nrow = nrow(hold_out)))
colnames(resul_raw) <- c("TARGET_raw", alg_fact, "Probability","ID")
resul_raw$TARGET_raw <- hold_out[as.character(TARGET0)]
resul_raw$ID<-hold_out$ID
#train_raw <- as.data.frame(matrix(NA, ncol = 3, nrow = nrow(traindata)))
#colnames(train_raw) <- c("TARGET", alg_fact, "Probability")
#train_raw$TARGET <- traindata[as.character(TARGET0)]
resul_pred_perf<-as.data.frame(matrix(NA, ncol = 1, nrow = 5))
colnames(resul_pred_perf)<-c(alg_fact)
rownames(resul_pred_perf)<-c("auc","sen","spec","accu","gmean")
resul_pred_perf_folds<-as.data.frame(matrix(NA, ncol = fold_no, nrow = 5))
colnames(resul_pred_perf_folds)<-paste0("fold",1:fold_no)
rownames(resul_pred_perf_folds)<-c("auc","sen","spec","accu","gmean")
resul_pred_perf_folds["auc",1]
fold_res<-list()
for (j in 1:fold_no) {
fold_res[[j]]<-result_model$pred[which(result_model$pred$Resample==paste0("Fold",j)),]
resul_pred_perf_folds["auc",j]<-AUC::auc(roc(fold_res[[j]]$Survival,fold_res[[j]]$obs))
resul_pred_perf_folds["sen",j]<-caret::sensitivity(fold_res[[j]]$pred,fold_res[[j]]$obs)
resul_pred_perf_folds["spec",j]<-caret::specificity(fold_res[[j]]$pred,fold_res[[j]]$obs)
resul_pred_perf_folds["accu",j]<-(as.data.frame(confusionMatrix(fold_res[[j]]$pred, fold_res[[j]]$obs)$overall))[1,]
resul_pred_perf_folds["gmean",j]<-sqrt(resul_pred_perf_folds["sen",j]*resul_pred_perf_folds["spec",j])
}
fold_res[["performance"]]<-resul_pred_perf_folds
resul_raw[alg_fact] <- predict(result_model, newdata=hold_out, type="raw")
resul_raw$Probability <- predict(result_model, newdata=hold_out, type="prob")[,2]
resul_pred_perf[1,1] <- AUC::auc(roc(resul_raw$Probability,hold_out[,as.character(TARGET0)]))
resul_pred_perf[2,1] <- caret::sensitivity(resul_raw[,alg_fact],hold_out[,as.character(TARGET0)])
resul_pred_perf[3,1] <- caret::specificity(resul_raw[,alg_fact],hold_out[,as.character(TARGET0)])
resul_pred_perf[4,1] <- (as.data.frame(confusionMatrix(resul_raw[,alg_fact], hold_out[,as.character(TARGET0[1])])$overall))[1,]
resul_pred_perf[5,1] <- sqrt(resul_pred_perf[2,1]*resul_pred_perf[3,1])
#}
gmean_overal<-as.data.frame(result_model$results)
gmean_folds<-as.data.frame(result_model$resample)
equat<-as.data.frame(result_model$finalModel$coefficients)
names(equat)<-"coef"
equat$vars<-rownames(equat)
equat$vars[which(equat$vars==c("(Intercept)"))]<-c("INTERCEPT")
max_gmean<-max(result_model$resample$gmean)
fold_pick<-result_model$resample[which(result_model$resample$gmean==max_gmean),c("Resample")]
foldbest<-result_model$pred[which(result_model$pred$Resample==fold_pick),c("obs","pred","Survival")]
names(foldbest)<-c(paste0("year",iii-1),"log","Probability")
dir.create(paste0(getwd(),"/data_pile"))
# the next line remove everything except for the mentioned data franes
rm(list=setdiff(ls(), c("resul_pred_perf","resul_raw","alg_fact","iii","coef","var_imp",
"gmean_overal","gmean_folds","equat","foldbest","fold_res")))
saveRDS(list(Performance=resul_pred_perf, Predicted=resul_raw,coef=coef,var_imp=var_imp,gmean_overal=gmean_overal,gmean_folds=gmean_folds,
equat=equat,foldbest=foldbest ),
paste0(getwd(),"/data_pile/",alg_fact,"-year-",iii-1,".rds"))
return(list(Performance=resul_pred_perf, Predicted=resul_raw,coef=coef,var_imp=var_imp,gmean_overal=gmean_overal,gmean_folds=gmean_folds,
equat=equat,foldbest=foldbest,folds_res=fold_res))
}else{
dir.create(paste0(getwd(),"/data_pile"))
rm(list=setdiff(ls(), c("experiment.summary","alg_fact","iii","coef","var_imp")))
save.image(paste0(getwd(),"/data_pile/NOT-",alg_fact,"-",iii,".RData"))
saveRDS(list(Performance="NOT", Predicted="NOT",coef=coef,var_imp=var_imp,gmean_overal="NOT",gmean_folds="NOT",
equat="NOT",foldbest="NOT"),
paste0(getwd(),"/data_pile/NOT-",alg_fact,"-year-",iii-1,".rds"))
return(list(Performance="NOT", Predicted="NOT",coef=coef,var_imp=var_imp,gmean_overal="NOT",gmean_folds="NOT",
equat="NOT",foldbest="NOT",folds_res="NO"))
}
}
##########################################################################
#identifying number of created clusters for parallel processing
if(Sys.info()[1]=="Windows"){
library(snow)
cl <- parallel::makeCluster(4)
}else{
library(Rmpi)
library(snow)
# assign cores used in the parallel computing
workers <- as.numeric(Sys.getenv(c("PBS_NP"))) - 1
s.no<-nrow(screen_design)
workers.no<-min(workers,s.no)
cl <- makeCluster(workers.no,"MPI")
}
time.begin <- proc.time()[3]
pred_LR <- parSapply(cl, 1:11,train_para2 ,t_data=train_data,h_data=holdOut_data,features=LASSO,
folds=5,resampl_meth="up",alg_used="glm",seed0=2019)
time.window <- proc.time()[3] - time.begin
stopCluster(cl)
if(Sys.info()[1]=="Windows"){saveRDS(pred_LR,paste0(data.path,"_pred_LR1124.rds"))}
if(Sys.info()[1]=="Linux"){saveRDS(pred_LR,paste0("/users/PMIU0138/miu0150/Transplant/data/_pred_LR1124.rds"))
mpi.quit()
}
The following code demonstrates a report about importance of the features
data.path<-"C:/Users/hahadydolatsara/OneDrive - Clark University/Research topics/Transplantation/two-stage/data/"
LR_models<-readRDS(url("https://github.com/transplantation/unos-ht/raw/master/data/_pred_LR1124.rds"))
heart.df.cleaned_all.dum_names<-readRDS(url("https://github.com/transplantation/unos-ht/raw/master/data/heart.df.cleaned_all.dum_names.rds"))
exclud <- c(paste0("year",seq(0,10)), "ID")
vars<-heart.df.cleaned_all.dum_names[!heart.df.cleaned_all.dum_names %in% exclud]
varimp_years<-as.data.frame(matrix(0, ncol = 12, nrow = length(vars)))
colnames(varimp_years)<-c("vars",paste0("year",seq(0,10)))
varimp_years$vars<-vars
perf_years<-as.data.frame(matrix(0, ncol = 13, nrow = 5))
colnames(perf_years)<-c("measure",paste0("year",seq(0,10)),"all_mean")
perf_years$measure<-c("AUC","Sensitivity","Specificity","Accuracy","Geometic_Mean")
for (i in 0:10){
varimp_years[varimp_years$vars %in% LR_models[4,i+1]$var_imp$vars,paste0("year",i)]<-1
perf_years[paste0("year",i)]<-LR_models[1,i+1]$Performance
}
perf_years$all_mean<-rowMeans(perf_years[paste0("year",seq(0,10))])
varimp_years$no_years<-rowSums(varimp_years[paste0("year",seq(0,10))])
varimp_years<-varimp_years[which(varimp_years$no_years!=0),]
saveRDS(varimp_years,paste0(data.path,"_holdOut_data.rds"))
The following code demonstrates the application of isotonic regression. Then performance of predictions before and after isotonic regression is compared.
library(pacman) # needs to be installed first
# p_load is equivalent to combining both install.packages() and library()
p_load(caret, AUC)
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
for (i in 0:10){
varimp_years[varimp_years$vars %in% LR_models[4,i+1]$var_imp$vars,paste0("year",i)]<-1
}
predictions<-list()
in_all_h<-LR_models[2,11]$Predicted$ID
all_preds<-as.data.frame(matrix(0, ncol = 12, nrow = length(in_all_h)))
all_TARGETS<-as.data.frame(matrix(0, ncol = 11, nrow = length(in_all_h)))
colnames(all_preds)<-c(paste0("year",seq(0,10)),"monotonic")
colnames(all_TARGETS)<-c(paste0("year",seq(0,10)))
all_TARGETS_pred<-all_TARGETS
for(i in 0:10){
predictions[[i+1]]<-LR_models[2,i+1]$Predicted[which(LR_models[2,i+1]$Predicted$ID %in%
in_all_h),]
if(sum(predictions[[i+1]]$ID==in_all_h)==length(in_all_h)){
all_preds[paste0("year",i)]<-predictions[[i+1]]$Probability
all_TARGETS[paste0("year",i)]<-predictions[[i+1]]$TARGET_raw[paste0("year",i)]
all_TARGETS_pred[paste0("year",i)]<-predictions[[i+1]]$glm
}
#monotonic
}
all_preds2<-all_preds
all_TARGETS_pred2<-all_TARGETS_pred
for(i in 1:nrow(all_preds)){
all_preds$monotonic[i]<-all(as.numeric(all_preds[i,paste0("year",seq(0,10))])==cummin(
as.numeric(all_preds[i,paste0("year",seq(0,10))])))
all_preds2[i,paste0("year",seq(0,10))]<-rev(isoreg(rev(as.numeric(all_preds[i,paste0("year",seq(0,10))])))$yf)
}
for(i in 1:nrow(all_preds2)){
for(j in 1:ncol(all_preds2[paste0("year",seq(0,10))])){
if(all_preds2[i,j]>=0.5){all_TARGETS_pred2[i,j]<-"Survival"}else{all_TARGETS_pred2[i,j]<-"Death"}
}
}
In the following, we evaluate the performance measures before and after isotonic regression is applied.
perf_before_iso<-as.data.frame(matrix(0, ncol = 13, nrow = 5))
colnames(perf_before_iso)<-c("Measures",paste0("year",seq(0,10)),"Mean")
perf_before_iso$Measures<-c("Accuracy","Specificity", "Sensitivity","AUC","G-Mean" )
perf_after_iso<-perf_before_iso
for(i in 0:10){
perf_before_iso[4,paste0("year",i)]<-(as.data.frame(confusionMatrix(all_TARGETS_pred[,paste0("year",i)],
all_TARGETS[,paste0("year",i)])$overall))[1,]
perf_before_iso[3,paste0("year",i)]<-caret::specificity(all_TARGETS_pred[,paste0("year",i)],all_TARGETS[,paste0("year",i)])
perf_before_iso[2,paste0("year",i)]<-caret::sensitivity(all_TARGETS_pred[,paste0("year",i)],all_TARGETS[,paste0("year",i)])
perf_before_iso[1,paste0("year",i)]<-AUC::auc(roc(all_preds[,paste0("year",i)],all_TARGETS[,paste0("year",i)]))
perf_before_iso[5,paste0("year",i)] <- sqrt(perf_before_iso[2,paste0("year",i)]*perf_before_iso[3,paste0("year",i)])
perf_after_iso[4,paste0("year",i)]<-(as.data.frame(confusionMatrix(all_TARGETS_pred2[,paste0("year",i)],
all_TARGETS[,paste0("year",i)])$overall))[1,]
perf_after_iso[3,paste0("year",i)]<-caret::specificity(all_TARGETS_pred2[,paste0("year",i)],all_TARGETS[,paste0("year",i)])
perf_after_iso[2,paste0("year",i)]<-caret::sensitivity(all_TARGETS_pred2[,paste0("year",i)],all_TARGETS[,paste0("year",i)])
perf_after_iso[1,paste0("year",i)]<-AUC::auc(roc(all_preds2[,paste0("year",i)],all_TARGETS[,paste0("year",i)]))
perf_after_iso[5,paste0("year",i)] <- sqrt(perf_after_iso[2,paste0("year",i)]*perf_after_iso[3,paste0("year",i)])
}
perf_before_iso$Mean<-rowMeans(perf_before_iso[paste0("year",seq(0,10))])
perf_after_iso$Mean<-rowMeans(perf_after_iso[paste0("year",seq(0,10))])
# calculating performances of all training folds
auc_folds<-as.data.frame(matrix(0, ncol = 5, nrow = 11))
colnames(auc_folds)<-c(paste0("auc-fold",seq(1,5)))
rownames(auc_folds)<-c(paste0("year",seq(0,10)))
accuracy_folds<-auc_folds
colnames(accuracy_folds)<-c(paste0("accuracy-fold",seq(1,5)))
sensitivity_folds<-auc_folds
colnames(sensitivity_folds)<-c(paste0("sensitivity-fold",seq(1,5)))
specificity_folds<-auc_folds
colnames(specificity_folds)<-c(paste0("specificity-fold",seq(1,5)))
gmean_folds<-auc_folds
colnames(gmean_folds)<-c(paste0("gmean-fold",seq(1,5)))
for (i in 0:10) {
for(j in 1:5){
auc_folds[paste0("year",i),paste0("auc-fold",j)]<-LR_models[9,i+1]$folds_res[["performance"]][1,j]
sensitivity_folds[paste0("year",i),paste0("sensitivity-fold",j)]<-LR_models[9,i+1]$folds_res[["performance"]][2,j]
specificity_folds[paste0("year",i),paste0("specificity-fold",j)]<-LR_models[9,i+1]$folds_res[["performance"]][3,j]
accuracy_folds[paste0("year",i),paste0("accuracy-fold",j)]<-LR_models[9,i+1]$folds_res[["performance"]][4,j]
gmean_folds[paste0("year",i),paste0("gmean-fold",j)]<-LR_models[9,i+1]$folds_res[["performance"]][5,j]
}
}
Here you can find performance of predictions before/after isotonic regression for the patients that their data is available in all the timestamps Here the results before isotonic regression is presented
before_iso<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/before-isotonicregression.csv")
DT::datatable(before_iso)
Here the results after isotonic regression is presented
after_iso<-read.csv("https://raw.githubusercontent.com/transplantation/unos-ht/master/data/rmarkdown/two_stage/after-isotonicregression.csv")
DT::datatable(after_iso)
In this section, we report the results obtained.
In this section, we exam how good our models are by checking the survival probability before and after applying Isotonic Regression.
We have created an interactive app that presents two modules for performing the analysis:
Manual Entry, where users can insert the values of predictor variables using several text boxes.
CSV Entry, where users can upload the values of predictor variables using a comma seperated variable (CSV) file.
Visit the app: here.
Aubrun University, hamid@auburn.edu↩
University of Dayton, ychen4@udayton.edu↩
Aubrun University, cje0010@auburn.edu↩
Aubrun University, azg0074@auburn.edu↩
Miami University, fmegahed@miamioh.edu↩